By the end of this lab, you will: 1. Load and analyze the Lightcast dataset in Spark DataFrame. 2. Create five easy and three medium-complexity visualizations using Plotly. 3. Explore salary distributions, employment trends, and job postings. 4. Analyze skills in relation to NAICS/SOC/ONET codes and salaries. 5. Customize colors, fonts, and styles in all visualizations (default themes result in a 2.5-point deduction). 6. Follow best practices for reporting on data communication.
Step 1: Load the Dataset
!pip install pandas plotly pyspark kaleido
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in /home/ubuntu/.local/lib/python3.10/site-packages (2.2.3)
Requirement already satisfied: plotly in /home/ubuntu/.local/lib/python3.10/site-packages (6.0.1)
Requirement already satisfied: pyspark in /home/ubuntu/.local/lib/python3.10/site-packages (3.5.5)
Requirement already satisfied: kaleido in /home/ubuntu/.local/lib/python3.10/site-packages (0.2.1)
Requirement already satisfied: numpy>=1.22.4 in /home/ubuntu/.local/lib/python3.10/site-packages (from pandas) (1.24.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/ubuntu/.local/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/ubuntu/.local/lib/python3.10/site-packages (from pandas) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /home/ubuntu/.local/lib/python3.10/site-packages (from pandas) (2025.1)
Requirement already satisfied: narwhals>=1.15.1 in /home/ubuntu/.local/lib/python3.10/site-packages (from plotly) (1.31.0)
Requirement already satisfied: packaging in /home/ubuntu/.local/lib/python3.10/site-packages (from plotly) (24.2)
Requirement already satisfied: py4j==0.10.9.7 in /home/ubuntu/.local/lib/python3.10/site-packages (from pyspark) (0.10.9.7)
Requirement already satisfied: six>=1.5 in /home/ubuntu/.local/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
import pandas as pdimport plotly.express as pximport plotly.io as piopio.renderers.default ="vscode"from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col# Initialize Spark Sessionspark = SparkSession.builder.appName("LightcastData").getOrCreate()# Load Datadf = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("./data/lightcast_job_postings.csv")# Show Schema and Sample Datadf.printSchema()df.show(5)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/24 01:48:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/24 01:48:15 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Identify salary trends across different employment types.
Filter the dataset
Remove records where salary is missing or zero.
Aggregate Data
Group by employment type and compute salary distribution.
Visualize results
Create a box plot where:
X-axis = EMPLOYMENT_TYPE_NAME
Y-axis = SALARY_FROM
Customize colors, fonts, and styles to avoid a 2.5-point deduction.
Explanation: Write two sentences about what the graph reveals.
import plotly.io as piopio.renderers.default ="notebook"
# Your Code for 1st question herepdf = df.filter((col("SALARY_FROM") >0) & (col("EMPLOYMENT_TYPE_NAME").isNotNull())) \ .select("EMPLOYMENT_TYPE_NAME", "SALARY_FROM").toPandas()fig = px.box( pdf, x="EMPLOYMENT_TYPE_NAME", y="SALARY_FROM", title="Salary Distribution by Employment Type", color="EMPLOYMENT_TYPE_NAME")fig.update_layout( template="plotly_white", font=dict(family="Helvetica Neue", size=16, color="#333"), title_font=dict(family="HelveticaNeue-CondensedBold", size=24), colorway=["#ec7424", "#a4abab"])fig.write_image("_output/salary_by_employment_type.svg")fig.show()
This box plot compares the salary ranges across different employment types. Full-time positions (>32 hours) show a wider range and higher median salary compared to part-time roles.
Salary Distribution by Employment Type
2 Salary Distribution by Industry
Compare salary variations across industries.
Filter the dataset
Keep records where salary is greater than zero.
Aggregate Data
Group by NAICS industry codes.
Visualize results
Create a box plot where:
X-axis = NAICS2_NAME
Y-axis = SALARY_FROM
Customize colors, fonts, and styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 2nd question herefrom pyspark.sql.functions import colimport plotly.express as pximport plotly.io as pioindustry_df = df.filter((col("SALARY_FROM") >0) & (col("NAICS2_NAME").isNotNull())) \ .select("NAICS2_NAME", "SALARY_FROM") \ .toPandas()fig = px.box( industry_df, x="NAICS2_NAME", y="SALARY_FROM", title="Salary Distribution by Industry", color="NAICS2_NAME")fig.update_layout( template="plotly_white", font=dict(family="Helvetica Neue", size=16, color="#333"), title_font=dict(family="HelveticaNeue-CondensedBold", size=24), colorway=["#ec7424", "#a4abab"], xaxis_title="Industry", yaxis_title="Salary (From)", showlegend=False)fig.write_image("_output/Salary Distribution by Industry.svg")fig.show()
The chart reveals that industries like Information and Finance & Insurance have higher salary medians and wider ranges. In contrast, industries such as Retail Trade and Accommodation & Food Services offer lower salary ranges.
Salary Distribution by Industry
3 Job Posting Trends Over Time
Analyze how job postings fluctuate over time.
Aggregate Data
Count job postings per posted date (POSTED).
Visualize results
Create a line chart where:
X-axis = POSTED
Y-axis = Number of Job Postings
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
Job postings fluctuate over time, with clear peaks and troughs that may align with recruitment cycles or seasonal demand. The trend appears to be periodic, suggesting consistent hiring behavior.
Job Posting Trends Over Time
4 Top 10 Job Titles by Count
Identify the most frequently posted job titles.
Aggregate Data
Count the occurrences of each job title (TITLE_NAME).
Select the top 10 most frequent titles.
Visualize results
Create a bar chart where:
X-axis = TITLE_NAME
Y-axis = Job Count
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
# Your code for 4th question herefrom pyspark.sql.functions import counttop_titles_df = df.filter(col("TITLE_NAME").isNotNull()) \ .groupBy("TITLE_NAME") \ .agg(count("*").alias("JOB_COUNT")) \ .orderBy(col("JOB_COUNT").desc()) \ .limit(10) \ .toPandas()fig = px.bar( top_titles_df, x="JOB_COUNT", y="TITLE_NAME", orientation="h", title="Top 10 Job Titles by Number of Postings", color="TITLE_NAME")fig.update_layout( template="plotly_white", font=dict(family="Helvetica Neue", size=16, color="#333"), title_font=dict(family="HelveticaNeue-CondensedBold", size=24), xaxis_title="Job Count", yaxis_title="Job Title", colorway=["#ec7424", "#a4abab"], showlegend=False)fig.write_image("_output/Top 10 Job Titles by Count.svg")fig.show()
Data Analysts are by far the most frequently posted job title, followed by Business Intelligence Analysts and Enterprise Architects. This suggests high demand for data-focused roles in the job market.
Top 10 Job Titles by Count
5 Remote vs On-Site Job Postings
Compare the proportion of remote and on-site job postings.
Aggregate Data
Count job postings by remote type (REMOTE_TYPE_NAME).
Visualize results
Create a pie chart where:
Labels = REMOTE_TYPE_NAME
Values = Job Count
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
The majority of job postings do not specify remote status, while a significant portion is clearly remote. Hybrid and non-remote roles account for smaller segments of the total postings.
Remote vs On-Site Job Postings
6 Skill Demand Analysis by Industry (Stacked Bar Chart)
Identify which skills are most in demand in various industries.
Aggregate Data
Extract skills from job postings.
Count occurrences of skills grouped by NAICS industry codes.
Visualize results
Create a stacked bar chart where:
X-axis = Industry
Y-axis = Skill Count
Color = Skill
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
The stacked bar chart illustrates the top skills demanded across the top 10 industries by skill count. It reveals that certain skills, such as SQL and data analysis, are consistently in demand across multiple sectors, while others are more industry-specific.
7 Salary Analysis by ONET Occupation Type (Bubble Chart)
Analyze how salaries differ across ONET occupation types.
Aggregate Data
Compute median salary for each occupation in the ONET taxonomy.
Visualize results
Create a bubble chart where:
X-axis = ONET_NAME
Y-axis = Median Salary
Size = Number of job postings
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
Only one ONET occupation type—Business Intelligence Analysts—is visible, with a median salary of around $88,000. This suggests either a limited dataset for this field or strict filtering.
Salary Analysis by ONET Occupation Type
8 Career Pathway Trends (Sankey Diagram)
Visualize job transitions between different occupation levels.
Aggregate Data
Identify career transitions between SOC job classifications.
Visualize results
Create a Sankey diagram where:
Source = SOC_2021_2_NAME
Target = SOC_2021_3_NAME
Value = Number of transitions
Apply custom colors and font styles.
Explanation: Write two sentences about what the graph reveals.
The Sankey diagram shows a strong transition flow from “Computer and Mathematical Occupations” to “Mathematical Science Occupations”. This implies a career trajectory trend from broad tech roles into more specialized scientific roles.
Career Pathway Trends
Conclusion & Insights Hiring Trends The number of job postings fluctuates significantly over time, indicating seasonal or campaign-based hiring patterns.
Top Roles Data Analysts are the most frequently posted roles, suggesting high demand for analytical talent across industries.
Salary Variation Salaries vary widely by employment type and industry, with full-time roles offering higher compensation on average.
Remote Work Most postings do not specify remote type, but among those that do, remote and hybrid work options are gaining traction.
Skills in Demand Skills like SQL, data visualization, and management appear across various industries, underscoring their broad utility.
Career Pathways There are visible transitions from general occupational categories to more specialized ones, showing clear career development routes.